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Abstract 

We analyze the convergence rate of the alternating direction method of multipliers (AD MM) 
for minimizing the sum of two or more nonsmooth convex separable functions subject to linear 
constraints. Previous analysis of the ADMM typically assumes that the objective function is the 
sum of only two convex functions defined on two separable blocks of variables even though the 
algorithm works well in numerical experiments for three or more blocks. Moreover, there has 
been no rate of convergence analysis for the ADMM without strong convexity. In this paper, we 
establish the global linear convergence of the ADMM for minimizing the sum of any number of 
convex separable functions. This result settles a key question regarding the convergence of the 
ADMM when the number of blocks is more than two or if the strong convexity is absent. It also 
implies the linear convergence of the ADMM for several contemporary applications including 
LASSO, Group LASSO and Sparse Group LASSO without any strong convexity assumption. 
Our proof is based on estimating the distance from a dual feasible solution to the optimal dual 
solution set by the norm of a certain proximal residual. 
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1 Introduction 



Consider the problem of minimizing a separable nonsmooth convex function subject to linear equal- 
ity constraints: 



where each is a nonsmooth convex function (possibly with extended values), x = (xf, ...,x K ) T € 
3? n is a partition of the optimization variable x, and E = (E\, E2, Ek) G sft mxn is an appropriate 
partition of matrix E (consistent with the partition of x) and q £ 3ft m is a vector. Notice that the 
model (jl.ip can easily accommodate general linear inequality constraints Ex > q by adding one 
extra block. In particular, we can introduce a slack variable xk+i > and rewrite the inequality 
constraint as Ex — x^ + i = q. The constraint xk+i > can be enforced by adding a new convex 
component function fK+ifaic+i) = i3t™(xK+i) to the objective function f(x), where i^.{xK+i) is 
the indicator function for the nonnegative orthant 



In this way, the inequality constrained problem with K blocks is reformulated as an equivalent 
equality constrained convex minimization problem with K + 1 blocks. 

Optimization problems of the form (jl.ip arise in many emerging applications involving struc- 
tured convex optimization. For instance, in compressive sensing applications, we are given an 
observation matrix A and a noisy observation vector b ~ Ax. The goal is to estimate the sparse 
vector x by solving the following i\ regularized linear least squares problem: 



where A > is a penalty parameter. Clearly, this is a structured convex optimization problem of 
the form (jl.ip with K = 2. If the variable x is further constrained to be nonnegative, then the 
corresponding compressive sensing problem can be formulated as a three block (K = 3) convex 
separable optimization problem (jl.ip by introducing a slack variable. Similarly, in the stable 
version of the robust principal component analysis (PCA) [55], we are given an observation matrix 
M £ sjj mxn which is a noise-corrupted sum of a low rank matrix L and a sparse matrix S. The 
goal is recover L and S by solving the following nonsmooth convex optimization problem 



minimize f(x) = fi(xi) + /2OE2) H h /k(xk) 



(1.1) 



subject to Ex = E\x\ + £^£2 + • • • + Ekxk = q 




0, if xk+i > (entry wise) 
00, otherwise. 



minimize ||y|| 2 + A||x||i 
subject to Ax + y = 6, 



minimize ||L||* + /9||5||i + A||Z 
subject to L + S + Z = M 
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where || • ||* denotes the matrix nuclear norm (defined as the sum of the matrix singular eigenvalues), 
while || • ||i and || • ||^ denote, respectively, the l\ and the Frobenius norm of a matrix (equal to the 
standard i\ and £2 vector norms when the matrix is viewed as a vector). In the above formulation, 
Z denotes the noise matrix, and p, A are some fixed penalty parameters. It is easily seen that the 
stable robust PCA problem corresponds to the three block case K = 3 in the problem with 
x = (L, S, Z) and 

h(L) = \\L\U, f 2 (S) = \\S\\ 1 , f 3 (Z) = \\Z\\ 2 F , (1.2) 

while the coupling linear constraint is given L + S + Z = M. In image processing applications 
where the low rank matrix L is additionally constrained to be nonnegative, then the above problem 
can be reformulated as 

minimize \\L\\* + p||<S1|i + ^II^IIf + %£™(C) 
subject to L + S + Z = M, L - C = 0, 

where C is a slack matrix variable of the same size as L, and %™(-) is the indicator function for 
the nonnegative orthant 3f?™ n . In this case, the stable robust PCA problem is again in the form 
of (jl.ip . In particular, it has 4 block variables (L, S, Z, C) and the first three convex functions are 
the same as in (|1.2|) . while the fourth convex function is given by /4(C) = i^^iC). The coupling 
linear constraints are L + S + Z = M, L — C = 0. Other applications of the form (jl.ip include the 
latent variable Gaussian graphical model selection problem, see [9]. 

A popular approach to solving the separable convex optimization problem (jl.ip is to attach a 
Lagrange multiplier vector y to the linear constraints Ex = q and add a quadratic penalty, thus 
obtaining an augmented Lagrangian function of the form 

L(x;y) = f(x) + (y,q-Ex) + ^\\q-Ex\\ 2 , (1.3) 

where p > is a constant. The augmented dual function is given by 

d(y) = min f(x) + (y,q — Ex) + -\\q — Ex\\ 2 (1.4) 
x 2 

and the dual problem (equivalent to (jl.ip under mild conditions) is 

max<i(y). (1-5) 

y 

Moreover, if p > 0, then Ex is constant over the set of minimizers of (|1.4[) (see Lemma 12.11 in 
Section 2). This implies that the dual function d(y) is differentiable with 

Vd(y) = q- Ex{y) 

where x{y) is a minimizer of (jl.4p . Given the differentiability of d(y), it is natural to consider the 
following dual ascent method to solve the primal problem (jl.ip 

y := y + aVd(y) =y + a(q - Ex(y)), (1.6) 
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where a > is a suitably chosen stepsize. Such a dual ascent strategy is well suited for structured 
convex optimization problems that are amenable to decomposition. For example, if the objective 
function / is separable (i.e., of the form given in (jl.ip ) and if we select p = 0, then the minimization 
in (jl.4p decomposes into K independent minimizations whose solutions frequently can be obtained 
in a simple form. In addition, the iterations can be implemented in a manner that exploits the 
sparsity structure of the problem and, in certain network cases, achieve a high degree of parallelism. 
Popular choices for the ascent methods include (single) coordinate ascent (see [31 [10J, [32l [38], SQl 
H2H3HE2I), gradient ascent (see [321I3Q1EI]) and gradient projection [231EI]. (See [5l[32ll37] for 
additional references.) 

For large scale optimization problems, it is numerically advantageous to select p > 0. Unfor- 
tunately, this also introduces variable coupling in the augmented Lagrangian (jl.3p . which makes 
the exact minimization step in (jl.4p no longer decomposable across variable blocks even if / has 
a separable structure. In this case, it is more economical to minimize (|1.4|) inexactly by updating 
the components of x cyclically via the coordinate descent method. In particular, we can apply 
the Gauss-Seidel strategy to inexactly minimize (jl.4p . and then update the multiplier y using an 
approximate optimal solution of (|1.4|) in a manner similar to (jl.6p . The resulting algorithm is 
called the Alternating Direction Method of Multipliers (ADMM) and is summarized as follows 
(see |17H20| ). In the general context of sums of monotone operators, the work of [16] describes a 
large family of splitting methods for K > 3 blocks which, when applied to the dual, result in similar 
but not identical methods to the ADMM algorithm (jl.7p below. 

Alternating Direction Method of Multipliers (ADMM) 

At each iteration r > 1, we first update the primal variable blocks in the Gauss- 
Seidel fashion and then update the dual multiplier using the updated primal 
variables: 

x k +1 = argmin L(x^ +1 , x r k ±\, x k , x r k+1 , x r K ;y r ), k = 1,2,..., K, 

/ K \ (1.7) 

y'+l = y r + a (q - Ex r+1 ) = y r + a [q - £ E k x^ , 

V k=l J 

where a > is the step size for the dual update. 

Notice that if there is only one block (K = 1), then the ADMM reduces to the standard 
augmented Lagrangian method of multipliers pQ for which the global convergence is well understood. 
In particular, it is known that, under mild assumptions on the problem, this type of dual gradient 
ascent methods generate a sequence of iterates whose limit points must be optimal solutions of the 
original problem (see |8, 47,49] ). For the special case of ordinary network flow problems, it is further 
known that an associated sequence of dual iterates converges to an optimal solution of the dual 
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(see [1]). The rate of convergence of dual ascent methods has been studied in the reference [35] which 
showed that, under mild assumptions on the problem, the distance to the optimal dual solution 
set from any y £ ffi 71 near the set is bounded above by dual optimality the 'residual' ||Vd(y)||. By 
using this bound, it can be shown that a number of ascent methods, including coordinate ascent 
methods and a gradient projection method, converge at least linearly when applied to solve the 
dual problem (see [331134]; also see [2|llip28| for related analysis). (Throughout this paper, by 'linear 
convergence' we mean root-linear convergence (denoted by R-linear convergence) in the sense of 
Ortega and Rheinboldt [39].) 

When there are two blocks (K = 2), the convergence of the ADMM was studied in the context of 
Douglas- Rachford splitting method |13H15] for finding a zero of the sum of two maximal monotone 
operators. It is known that in this case every limit point of the iterates is an optimal solution 
of the problem. The recent work of [2 HI221I26] have shown that the objective values generated 
by the ADMM algorithm and its accelerated version converge at a rate of 0(l/r) and 0(l/r 2 ) 
respectively. Moreover, if the objective function f(x) is strongly convex and the constraint matrix 
E is row independent, then the ADMM is known to converge linearly to the unique minimizer 
of (jl.ip . [One notable exception to the strong convexity requirement is in the special case of 
linear programming for which the ADMM is linearly convergent [H].] More recent convergence 
rate analysis of the ADMM still requires at least one of the component functions (f\ or $2) to 
be strongly convex and have a Lipschitz continuous gradient. Under these and additional rank 
conditions on the constraint matrix E, some linear convergence rate results can be obtained for a 
subset of primal and dual variables in the ADMM algorithm (or its variant); see [6" lll2p24| . However, 
when there are more than two blocks involved (K > 3), the convergence (or the rate of convergence) 
of the ADMM method is unknown, and this has been a key open question for several decades. The 
recent work [36] describes a list of novel applications of the ADMM with K > 3 and motivates 
strongly for the need to analyze the convergence of the ADMM in the multi-block case. The recent 
monograph [7] contains more details of the history, convergence analysis and applications of the 
ADMM and related methods. 

A main contribution of this paper is to establish the global (linear) convergence of the ADMM 
method for a class of convex objective functions involving any number of blocks (K is arbitrary). 
The key requirement for the global (linear) convergence is the satisfaction of a certain error bound 
condition that is similar to that used in the analysis of [35]. This error bound estimates the 
distance from an iterate to the optimal solution set in terms of a certain proximity residual. The 
class of objective functions that are known to satisfy this error bound condition include many of 
the compressive sensing applications, such as LASSO [36], Group LASSO [52] or Sparse Group 
LASSO [53]. 

In our notation, all vectors are column vectors and ?R. n denotes the n-dimensional Euclidean 
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space. For any vector x 6 5i n , we denote by X{ the ith coordinate of x and, for any I C {1, ...,n}, 
by x/ the vector obtained after removing from x those Xj with i /. We also denote by ||x|| the 
usual Euclidean norm of x, i.e., ||x|| = sj (x, x) with (x,y) = ^2 i Xiy.i. For any h x k matrix A, 
we denote by Ai the ith row of A, and by Aj the submatrix of A obtained by removing all rows 
Ai with i $l I. For any function h with gradient V/i, the notations Vj/i and V//i carry analogous 
meaning. For any closed convex set X and any vector x in the same space, we denote by [x]^ the 
orthogonal projection of x onto X. 

2 Technical Preliminaries 

Let / be a closed proper convex function in 3ft n , let E be an m x n matrix, let q be a vector in 3f? m . 
Let dom / denote the effective domain of / and let int(dom /) denote the interior of dom /. We 
make the following standing assumptions regarding /: 

Assumption A. 

(a) The global minimum of (jl.ip is attained and so is its dual optimal value. The intersection 
int(dom /) Pi {x \ Ex = q} is nonempty. 

(b) / = fi(xi) + 72(^2) + • • • + fxixR), with each f}, further decomposable as 

fk{xk) = 9k(A k x k ) + {b k ,x k ) + h k (x k ) 

where g k and h k are both convex and continuous over their domains, and A k s are some given 
matrices (not necessarily full column rank, and can be zero), while b k s are given vectors of 
appropriate dimensions. 

(c) Each g k is strictly convex and continuously differentiable on int(dom g k ) with a uniform 
Lipschitz continuous gradient 

||V#fc(2 fc ) - V# fe (4)|| < L\\z k ~ 411. V 4 6 int(dom g k ), 

where L > is a constant. Moreover, ||Vgfc(zfc)|| — > 00 whenever z k approaches the boundary 
of int(dom /) or ||^j.|| — > 00. 

We remark that the inclusion of a linear term (b k ,x k ) in Assumption A(b) is needed to cover 
cases where b k lies outside the row span of A k . In the subsequent analysis, however, we shall assume 
b k = for all k to simplify the notations. All the ensuing proofs and results will remain true with 
trivial modifications for b k 7^ case. 
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Under Assumption A, both the primal optimum and the dual optimum values of (jl.ip are 
attained and are equal (i.e., the strong duality holds for (jl.ip ) so that 

d* = maxL(x;y) = max (f(x) + (y,q — Ex) + ^\\Ex — q\\ 2 ) = min f(x), 
y y V 2 / Ex=q 

where d* is the optimal value of the dual of (jl.ip . 

Roughly speaking, Assumption A requires that the smooth part of / (i.e., the gfc's), in addition 
to satisfying certain regularity conditions, be a co-finite, strictly convex essentially smooth function 
or, in the terminology of Rockafellar [42 \ Sec. 26], a co-finite convex function of the Legendre type. 
In general, Assumption A(c) is satisfied whenever the smooth part of / is strongly convex twice 
differentiable or whenever its Hessian is positive definite everywhere on the interior of its effective 
domain. 

Although Assumption A may seem restrictive, there are a number of important special cases 
that satisfy this assumption. These include (i) strictly convex quadratic programs (see [32 ), (ii) 
certain problems of matrix balancing and image reconstruction, where f(x) is the entropy function 
Y^ 1 j=i x j^ nx j ( see ) , (in) a problem of optimal routing on data networks, where f(x) is 

the inverse barrier function X^j=i V( c j ~~ x j) with Cj > (see [4]), and (iv) the Hazen- Williams' 
model of flow through pipe networks, where f(x) is the power function Y^=i a j( x j) C with aj > 
and c« 2.85 (see 01] ). 

Under Assumption A, there may still be multiple optimal solutions for both the primal problem 
(jl.ip and its dual problem. We first claim that the dual functional 

d{y) = minL(x; y) = min f(x) + (y,q — Ex) + -\\q — Ex\\ 2 , (2.1) 

XX 2 

is differentiable everywhere. Let X{y) denote the set of optimal solutions for (|2.1j) . 

Lemma 2.1 For any y G W l , both Ex and A^x^, k = 1,2, ...,K, are constant over X(y). More- 
over, the dual function d{y) is differentiable everywhere and 

Vd(y) = q - Ex(y), 

where x(y) G X(y) is any minimizer of (j2.ip . 

Proof. Fix y £ !ft m . We first show that Ex is invariant over X{y). Suppose the contrary, so that 

there exist two optimal solutions x and x' from X(y) with the property that Ex ^ Ex' . Then, we 
have 

d{y) = L(x;y) = L(x';y). 
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Due to the convexity of L(x; y) with respect to the variable x, the solution set X(y) must be convex, 
implying x = (x + x')/2 G X{y). By the convexity of f(x), we have 

\ [(/(*) + {y,q~ Ex)) + (f(x') + (y,q- Ex'))} > f(x) + {y,q- Ex). 

Moreover, by the strict convexity of || • || 2 and the assumption Ex ^ Ex', we have 

- (\\Ex - qf + \\Ex' - q\\ 2 ) > \\Ex - qf. 

Multiplying this inequality by p/2 and adding it to the previous inequality yields 

- [L(x; y) + L(x'; y)] > L(x; y), 

which further implies 

d(y) > L(x;y). 

This contradicts the definition d{y) = mm. x L(x\y). Thus, Ex is invariant over X(y). Notice that 
d(y) is a concave function and its sub differential is given by 

dd{y) = Closure of the convex hull { q — Ex(y) \ x{y) 6 X(y) }. 

Since Ex(y) is invariant over X(y), the subdifferential dd{y) is a singleton. By Danskin's Theorem, 
this implies that d{y) is differentiable and the gradient is given by Vd(y) = q — Ex(y), for any 
x(y) e X(y). 

A similar argument (and using the strict convexity of gk) shows that A^Xk is also invariant over 
X(y). The proof is complete. Q.E.D. 

To show the linear convergence of the ADMM method, we need a local error bound around 
the optimal solution set X{y). To describe this local error bound, we first define the notion of a 
proximity operator. Let h : dom (h) i— > K be a (possibly nonsmooth) convex function. For every 
x £ dom(h), the proximity operator of h is defined as 

prox ft (x) = argmin h(u) H — ||aj — u\\ 2 . 

Notice that if h{x) is the indicator function of a closed convex set X, then 

pmx h (x) = proj x (x), 

so the proximity operator is a generalization of the projection operator. In particular, it is known 
that the proximity operator satisfies the nonexpansiveness property: 

||prox fe (x) — pTox h (x')\\ < \\x — x'\\, V x,x'. (2-2) 
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The proximity operator can be used to characterize the optimality condition for a nonsmooth 
convex optimization problem. Suppose a convex function / is decomposed as f(x) = g(Ax) + h(x) 
where g is strongly convex and differ entiable, h is a convex (possibly nonsmooth) function, then 
we can define the proximal gradient of / with respect to h as 

V/(x) := x — prox ft (x — V(/(x) — h(x))) = x — prox h (x — A T Vg(Ax)). 

If h = 0, then the proximal gradient Vf(x) = Vf(x). In general, V/(x) can be used as the 
(standard) gradient of / for the nonsmooth minimization mm x ^x f( x )- F° r example, V/(x*) = 
iff x* is a global minimizer. 

For the Lagrangian minimization problem (|2.ip and under Assumption A, the work of [35,48, 53j 
suggests that the size of the proximal gradient 

V x L(x;y) := x - prox h (x - V x {L(x; y) - h(x))) 

= x- prox h (x - A T Vg(Ax) + E T y - P E T (Ex - q)) (2.3) 

can be used to upper bound the distance to the optimal solution set X(y) of (|2.ip . Here 

K K 

h(x) = y^hk(x k ), g{Ax) = y^fffc(A fc Xfc) 
k=l fe=l 

represent the nonsmooth and the smooth parts of f(x) respectively. 

Assumption B. For any 5 > 0, there exists a positive scalar r such that, for any (x, y) satisfying 
\\x\\ + ||y|| < 5, the following error bounds hold 

dist(y,Y*) = ||y-yl<T||W(y)||, 

and 

dist (x,X(y))<T\\V x L(x;y)\\, 

where Y* denotes the dual optimal solution set and the proximal gradient V x L(x; y) is given by 
(|2.3|) . Moreover, the constant r is independent of the choice of y and x. 

The next lemma says if the nonsmooth part of takes a certain form, then Assumption B 
holds. 

Lemma 2.2 Suppose Assumption A holds. Then the local error bounds in Assumption B hold if 
the nonsmooth component functions hk(xk) of the objective function satisfy one of the following 
properties: 

(1) Either the epi-graph of hk(xk) is a polyhedral set; 
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(2) Or h k (x k ) = \k\\xk\\i + J2j uj\\xk,j\\2, where x k = (■■■ ,x k ,j, ■■■) is a partition of x k with J 
being the partition index. 

The proof of Lemma l2.2l is identical to that of [35 , 48 , 53J for any fixed y. The only new ingredient 
in Lemma 12.21 is the additional claim that the constants 5, r are both independent of the choice of 
y. This property follows directly from a similar property of Hoffman's error bound (on which the 
error bounds of [35 , 48 , 53j are based) for a feasible linear system P := {x \ Ax < b}: 

dist(x,P) <r||[Ax-6]+||, Vie» n , 

where r is independent of b. In fact, a careful checking of the proof of [35, 48,53] shows that the 
corresponding error constants 5 and r for the augmented Lagrangian function L(x; y) can be indeed 
made independent of y. We omit the proof of Lemma 12.21 for space consideration. 

By using Lemma |2.1| we show below a Lipschitz continuity property of Ve£(y), for y over any 
level set of d. 



Lemma 2.3 Fix any scalar rj < f* and letU = { y £ K m | d(y) > 77 }. Then there holds 

\\Vd(y') - Vd(y)\\ < -y -yll, V y' G U, y G U. 
P 

Proof. Fix any y and y' in U. Let x = x(y) and x' = x(y') be two minimizers of L(x;y) and 
L(x;y') respectively. By convexity, we have 

Vf(x)-E T y + pE T (Ex-q)=0 and V f{x') - E T y' + pE T '(Ex' - q) = 0, 

where V f(x) and V/(x') are some subgradient vectors in the subdifferential df(x) and df(x') 
respectively. Thus, we have 

(Vf(x) - E T y + pE T (Ex - q), x' - x) = 

and 

(V/(a/) - E T y' + P E T (Ex' -q),x- x') = 0. 
Adding the above two equalities yields 

(V/(x) - V/(x') + E T {y' -y)- pE T E(x' - x), x' - x) = 0. 

Upon rearranging terms and using the convexity property 

(Vf(x')-Vf(x),x' -x}>0, 
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we get 

(y' - y, E(x' - x)) = (Vf(x') - Vf{x),x' - x) + p\\E{x' - x)f > p\\E{x' - x)\\ 2 . 

Thus, p\\E(x' — x)\\ < \\y' — y\\ which together with Vd(y') — Vd(y) = E(x — x') (cf. Lemma l2.ip 
yields 

||Vd(y') - Vd(y)|| = \\E(x' - x)\\ < -\\y - y'\\. 

P 

The proof is complete. Q.E.D. 

In our analysis of ADMM, we will also need an error bound for the dual function d(y). Notice 
that a y £ 3ft" 1 solves fjl .5[) if and only if y satisfies the system of nonlinear equations 

Vd(y) = 0. 

This suggests that the norm of the 'residual' ||Vd(y)|| may be a good estimate of how close y is 
from solving (j 1 . 5 f) . We show that this is true in the sense that, for all y such that the above residual 
is small and d(y) is bounded below, the distance from y to Y* (dual optimal solution set), defined 
by 

dist(y,y*)= min \\y — y*\\, 

is bounded above by the norm of this residual. Error bounds like this are similar to that in 
Assumption B and have been studied previously by Pang |41] and by Mangasarian and Shiau [37], 
though in different contexts. The above error bound is 'local' in that it holds only for those y that 
are bounded or near Y* (as opposed to a 'global' error bound which would hold for all y in K m ). 
This local error bound was established in [35] (see Corollary 4.1 therein) which does not require / 
to be co-finite. 

Assumption C. Each submatrix E\. has full column rank. 

Under Assumption C, the augmented Lagrangian function L(x;y) (cf. fjl .3[) ) is strongly convex 
with respect to each subvector xj.. As a result, each alternating minimization iteration of ADMM 
«LTJ) 

x l +1 = a *g m[n H x i +1 i---i x k-v x k' x l+ii---' x K'>y r )' k = i, 

has a unique optimal solution. Thus the sequence of iterates {x r } of the ADMM are well defined. 
The following lemma shows that the alternating minimization of the Lagrangian function gives a 
sufficient descent of the Lagrangian function value. 

Lemma 2.4 Suppose Assumption C holds. Then fix any index r, we have 

L(x r ; y r ) - L{x r+1 ;y r ) > j\\x r - x r+1 \\ 2 , (2.4) 
where the constant 7 > is independent of r and y r . 
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Proof. By Assumption C, the augmented Lagrangian function 



K 



L ( x ; y) = (fk( x k) + (yfc> ik - E k x k )) + 



fc=i 



K 



k=l 



is strongly convex per each variable x k and has a uniform modulus pX m i n (E k E k ) > 0. Here, the 
notation A m ; n (-) denotes the smallest eigenvalue of a symmetric matrix. This implies that, for each 
k, 

L(x;y) - L(xi,..,x k _i,x k ,x k+1 ,...,x K ;y) > pX m m(E k E k )\\x k - x k \\ 2 , (2.5) 
for all x, where x k is the minimizer of min Xfe L(x;y) (when all other variables {xj}j^ k are fixed). 

Fix any index r. For each k £ {1,...,K}, by ADMM (|1.7p . x r k +l is the minimizer of 
L(x[ +l , x r k +\, x k , x r k+1 , x r k+2 , ...,x r K ;y r ). It follows from ([23]) 

• ■ • i i > x \i ■■■■> x K>y r ) ~ ^( x i +1 5 ■•■> —j y r ) — 7ll x fc ~~ x fc +1 || 2 ' ^ ^' (2-6) 

where 

7 = pminX min (E k E k ) 
k 

is independent of r and g/ r . Summing this over k, we obtain the sufficient decrease condition 

L(x r ; y r ) - L(x r+1 ;y r ) > 7 ||x r - x r+1 || 2 . 



Q.E.D. 



This completes the proof of Lemma [2? 



To prove the linear convergence of the ADMM algorithm, we also need the following lemma 
which bounds the size of the proximal gradient VL(x r ; y r ) at an iterate x T . 

Lemma 2.5 Suppose Assumption A holds. Let {x r } be generated by the ADMM algorithm ([1.7D . 
Then there exists some constant a > (independent of y r ) such that 



\VL(x r ;y r )\\ < a\\x 



r+1 _ ri 



(2.7) 



for all r > 1. 



Proof. Fix any r > 1 and any 1 < k < K . According to the ADMM procedure (|l,7p . the variable 
x k is updated as follows 

2\ 



= argmin 



h k (x k ) +g k (A k x k ) 



(y r ,E k x k ) + - 



E k x k + Y, E i x T 1 + Yj E : 

j<k j>k 



J x j 
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The corresponding optimality condition can be written as 



x 



r+l 



-r+l 



Therefore, we have 



n r+l 



■ P rox h fc [ x k 

prox fcfc 



A T k V Xk g k (A k x r k +1 ) + E T k f - pE T k [ ]T Erf 1 + £ Erf - q 

.j<k j>k 



A T k V Xk g k (A k xl) + Elf - pEl (Ex r - q)) 



. (2.8) 



x 



l +l - A T k V Xk g k {Arf l ) + Elf + pE T k \ Erf 1 + J> ; 

j<k j>k 



'3 X j 



-prox ftfe (x r k - AiV Xk g k (A k x r k ) + Elf + pEl (Ex r - q)) 



< 



rf 1 - x l) 



A T k {V Xk g k {Arf l ) - V Xk g k {A k xD) + P E T k £ E 3 { 



ry.r+1 _ 



^ \\ x k 

< c\\x 



r+l 



xl\\ +L||^||||A fe ||||4+ 1 -x r k \\ + P \\El\\J2 \\x) 
x r \\, for some c > independent of f , 



r + l 



(2- 



where the first inequality follows from the nonexpansive property of the prox operator (|2.2p . and 
the second inequality is due to the Lipschitz property of the gradient vector Vg k (cf. Assumption 
A). Using this relation and the definition of the proximal gradient VL(x r ; y r ), we have 



\V Xk L(x r ;f 



\ x k - P rox ^ fc ( x k ~ A k^x k 9k{A k x r k ) + E k y r - pE k (Ex r - q)) \ 



< II4 " 4 + II + \\ x k ~ P^ (x r k - A{ V Xk g k {A k xl) +E^f- pE{ (Ex r - q) 



< (c + l)\\x r+1 -x r \\, V k = 1,2,..., K. 
This further implies that the entire proximal gradient vector can be bounded by ||x r+1 — x r \\: 

||VL(x r ;y r )|| < {c + l)VK\\x r+1 - x r \\. 
Setting a = (c+ 1)V~K (which is independent of y r ) completes the proof. Q.E.D. 



3 Linear Convergence of ADMM 



Let d* denote the dual optimal value and {x r ,y r } be the sequence generated by the ADMM method 
(jl.7|) . Further we denote 

A r d = d*-d(f) (3.1) 
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which represents the gap from dual optimality at the r-th iteration. The primal gap to optimality 
at iteration r is defined as 

A; = L(x r+1 ;y r )-d(y r ), r > 1. (3.2) 

Clearly, we have both > and A^ > for all r. To establish the linear convergence of ADMM, 
we need several lemmas to estimate the sizes of the primal and dual optimality gaps as well as their 
respective decrease. 

Let X(y r ) denote the set of optimal solutions for the following optimization problem 
mmL(x; y r ) = min fix) + (y r ,q — Ex) + ~\\Ex — q\\ 2 . 

XX 2 

We denote 

x r = argmin \ \x — x r \\. 
We first bound the sizes of both the dual and primal optimality gaps. 

Lemma 3.1 For any scalar 5 > 0, there exists a positive scalar t' such that 

A r d <T'\\Vd(tf)\\ 2 = T>\\Ex(tf)- q \\\ (3.3) 

for any y r G 3? m with \\y r \\ < 5. Moreover, there exist positive scalars Q and {independent of y r ) 
such that 

K<C\\x r+1 -x r f + ^\\x r -x r f, for all r> 1. (3.4) 



Proof. Fix any y r , and let y* be the optimal dual solution closest to y r . Then it follows from the 
mean value theorem that there exists some y in the line segment joining y r and y* such that 

A d = d(y*)-d(y r ) 
= (Vd(y),y*-y r ) 
= (Vd(y)-X7d(y*),y*-y r ) 

< \\vd(y)-Vd(y*)\\\\y*-y r \\ 

^ ll *" *llll* r* 1 1 

< -\\y-y y -y 
p 

^* 117" *llll* T \ \ 

< -\\y -y y -y 
p 

II * j- II 2 

= -\\y -y 
p 

where the second inequality follows from Lemma 12.31 Recall from Assumption B that there exists 
some r such that 

di S t(y r ,y*) = ||y r -y*||<r||Vd(y r )ll- 
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Combining the above two inequalities yields 

A r d = d(y*)-d(y r )<r'\\Vd(y r )\\ 2 , 

where r' > r 2 / p is a constant. This establishes the bound on the size of dual gap (|3.3[) . 

It remains to prove the bound on the primal gap (|3.4p . For notational simplicity, let us separate 
the smooth and nonsmooth part of the augmented Lagrangian as follows 

L{x;y) = g(x) + h(x) + (y,q- Ex) + ^\\q - Ex\\ 2 := L(x;y) + h(x). 

Let x r k +l denote the fe-th subvector of the primal vector x r+l . From the way that the variables 
are updated (12. 5J) , we have 



= P rox ^ 



r+l 



r 



V Xk L(x r ; f) - 4 + x r k +1 - V Xk L ({zJ+J}, { 4 },,,://) + V Xk L(x r ; f 
■= Prox hfe [4 - V x . fc L(x r ; y r ) - e r k ] 

where the gradient vector V Xk L ({x r j< k },{x r j}j >k ;y r j can be explicitly expressed as 



(3.5) 



V, fc L ({*$+*}, V r ) = A T k V Xk g(A k x r k +1 ) - E T k f + pE k ( £ E jX ' + 1 + £ ^ - g I 



and the error vector e k is defined 



„r — r _ r+l 
e fc - — x k d/t 



(3.6) 



Note that we can bound the norm of e k as follows 

II4II < 114 - + llv. fc L ({^ilh{^h>k;y r ) - v,. /.(,•' :/ 



^- r 

I t — X 



r+l I 



+ 



A\ (V Xk g(A k xl +1 ) - V Xk g(A k xD) + P E T k [ £ ^-(xf 1 - 4) 



<c||x r -i r+1 ||, 

where the constant c > is independent of y r , and can take the same value as in (|2.9p . 
Using ()3.5p . and by the definition of the proximity operator, we have the following 



%(4 +i ) + (4 +1 - xi,V Xk L(x r ;y r ) + e£) + 



1 



|„r+l _ „ri|2 

I ^ U ^h\\ 



< h k {x r k ) + (4 - 4, V Xk L(x r ;y r ) + e r fe ) + -||4 - 4|| 2 . 



(3.7) 



(3.8) 
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Summing over all k = 1, • • • , K, we obtain 

/i(* r+1 ) + - x r , V*L(x r ;y r ) +e r ) + ^\\x r+1 - x r \\ 2 

< h(x r ) + (x r - x\V x L(x r ;y r ) + e r ) + i||x r - x r || 2 . 
Upon rearranging terms, we obtain 

h{x r+1 ) - h(x r ) + (x r+1 - x r ,V x L{x r ;y r )) < ^\\x r - x r \\ 2 - (x r+1 - x r ,e r ). (3.9) 

Also, we have from the mean value theorem that there exists some x in the line segment joining 
x r+l and x r such that 

L{x r+1 ;y r ) - L(x r ;y r ) = (V x L(x;y r ),x r+1 - x r ). 

Using the above results, we can bound A£ by 

A r p = L(x r+1 ;y r )-L(x r ;y r ) 

= L(x r+1 ;y r ) - L(x r ; y r ) + h(x r+1 ) - h{x r ) 
= (V x L(x;y r ),x r+1 - x r ) + h(x r+1 ) - h(x r ) 

= (V x L(x-;y r ) - V x L(x r ;y r ),x r+1 - x) + (V x L(x r ; y r ),x r+1 - x) + h(x r+1 ) - h(x r ) 

< (V x L(x; y r ) - V x L(x r ; y r ),x r+1 - x) + ^\\x r - x r f + cVK\\x r+1 - x r \\ \\x r+1 - x r \\ 

< (y2L\\A k f\\A k \\+p\\E T E\\\ \\x-x r \\\\x r+1 -x r \\ 

+ \\\x r ~ x r t + cVK\\x r+1 - x r \\\\x r+1 - x r \\ 



A 



< [Y,L\\M T \\Ak\\+P\\E T E\\) (||x r+1 -xl + ||x r -x f 



-,,\ 2 



+ \\\x r ~ * r t + cVK\\x r+1 - x r \\ (\\x r+1 - x r \\ + \\x r - x r \\) 
< (\\x r+1 -x r \\ 2 + ('\\x T -x r \\ 2 , 

where the first inequality follows from (|3.9p and (|3.Tj> : the third inequality is due to the fact that 
x lies in the line segment joining x T+1 and x r so that \\x — x r \\ < ||x r+1 — x r \\ + \\x r — x r \\. This 
completes the proof. Q.E.D. 

We then bound the decrease of the dual optimality gap. 

Lemma 3.2 For each r > 1, there holds 

A r d - A^ 1 < -a(Ex r - qf(Ex r - q). (3.10) 
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Proof. The reduction of the optimality gap in the dual space can be bounded as follows: 




= [ d *-d(yn]-[d?-d(y r - 1 )} 

= d{y^)-d(y r ) 

= L(x r ~ 1 ;y r ~ 1 ) — L(x r ; y r ) 

= [L{x r ;y r - l )-L{x r ;y r )\ + [L(x r ~ 1 ;y r ~ 1 ) — L(x r ;y r ~ 1 )] 

= (iT 1 ~ yl T (q ~ Ex r ) + [L{x r ~ 1 ;y r " 1 ) - L(x r ]y r ~ 1 )} 

-a(Ex r - q) T (Ex r - q) + [L(x r-1 ; y^ 1 ) - L(x r ; y*' 1 )] 

< -a(Ex r -q) T {Ex r - q), V r > 1, 



where the last equality follows from the update of the dual variable y 



,r-l 



Q.E.D. 



Lemma 13.21 shows that if q — Ex r is close to the true dual gradient Vd(y r ) = q — Ex r , then 
the dual optimal gap is reduced after each ADMM iteration. However, since ADMM updates the 
primal variable by only one Gauss-Seidel sweep, the primal iterate x r is not necessarily close the 
minimizer x r of L(x; y r ). Thus, unlike the method of multipliers (for which x r = x r for all r), there 
is no guarantee that the dual optimality gap is indeed reduced after each iteration of ADMM. 

Next we proceed to bound the decrease in the primal gap AL 
Lemma 3.3 For each r > 1, we have 



A r p - A;" 1 < a\\Ex r 




— j\\x 



r+1 



— x 




a(Ex r -qf(Ex r - q) 



(3.11) 



for some 7 independent of y r . 



Proof. Fix any r > 1, we have 



L(x r ;y r - 1 ) = f(x r ) + (y r -\q 




and 



L(x r+1 ;y r ) = f(x 



) + {y\q-Ex r+l ) + ^\\Ex r+l -q 



2 



By the update rule of y r (cf. (|1.7|) ). we have 



L(x r ;y r ) = f(x r ) + (y r ~\q 




This implies 



L(x r ;y r ) = L(x r ; y r ~ l ) + a\\Ex r - q 
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Recall from Lemma 12.41 that the alternating minimization of the Lagrangian function gives a suffi- 
cient descent. In particular, we have 

L(x r+1 ;y r ) -L{x r -y T ) < - 7 ||x r+1 - x r f, 

for some 7 > that is independent of r and y r . Therefore, we have 

L(x r+1 ;/)-L(a; r ;y r - 1 ) < a\\Ex r - qf - j\\x r+1 - x r f , V r > 1. 

Hence, we have the following bound on the reduction of primal optimality gap 

A; -A- 1 = [L{x r+1 -y r )-d{y r )]-[L{x r -f- 1 )-d{f- 1 )] 
= [L(x r+1 ;y r ) - L(x r ;y r ~ 1 )] - [d{y r ) - d^" 1 )] 

< a\\Ex r - qf -j\\x r+1 -x r \\ 2 - a(Ex r - q) T (Ex r -q), V r > 1, 
where the last step is due to Lemma [331 Q.E.D. 

Notice that when a = (i.e., no dual update in the ADMM algorithm), Lemma 13.31 reduces to 
the sufficient decrease estimate (|2.4[) in Lemma 12.41 When a > 0, the primal optimality gap is not 
necessarily reduced after each ADMM iteration due to the positive term a||Sx r — q\\ 2 in (|3.1ip . 
Thus, in general, we cannot guarantee a consistent decrease of either the dual optimality gap A^ 
or the primal optimality gap A£. However, somewhat surprisingly, the sum of the primal and dual 
optimality gaps decreases for all r, as long as the dual step size a is sufficiently small. This is used 
to establish the linear convergence of ADMM method. 

Theorem 3.1 Suppose Assumptions A, B, C hold and that the level set of A^ + A p is bounded, 
i.e., 

5 := sup { \\x\\ + ||2,]| I [L(x; y) - d{y)\ + [d* - d(y)] < A° + A° } < 00. (3.12) 

Then the sequence of iterates {x r ,y r } generated by the ADMM algorithm fjl -Tj) converge linearly to 
an optimal primal-dual solution for (jl.ip . provided the stepsize a is sufficiently small. Moreover, 
the sequence of function values {f(x r )} also converges linearly. 



Proof. We show by induction that the sum of optimality gaps A r d + A£ is reduced after each 
ADMM iteration, as long as the stepsize a is chosen sufficiently small. For any r > 1, we denote 

x r = argmin \\x — x r \\. (3.13) 

By induction, suppose AJJ -1 + AIJ -1 < A[] + A^J for some r > 1. Then, ||x r || < 5 and it follows from 
Assumption B that 

\\x r -x r \\ <T\\VL(x r ;y r )\\ (3.14) 
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for some r > (independent of y r ). To prove Theorem 13. 1\ we combine the two estimates (|3.10p 
and (|3.1ip to obtain 

[a; + Aa - [a;- 1 + a- 1 ] = [a; - a;- 1 ] + [as - a- 1 ] 

< a\\Ex r - q\\ 2 - j\\x r+l - x r \\ 2 - 2a(Ex r - qf(Ex r - q) 

= a\\Ex r - Ex r \\ 2 -a\\Ex r -q\\ 2 -~f\\x r+1 -x r \\ 2 . (3.15) 

Now we invoke (|3.14|) and Lemma 12.51 to lower bound ||x r+1 — x r \\: 

\\x r -x r \\ < T\\VL(x r ;y r )\\ < ra\\x r+1 - x r \\. (3.16) 

Substituting this bound into f|3. 15|) yields 

[A; + A r d ] - [A"- 1 + A^" 1 ] < (a\\E\\ 2 T 2 a 2 - j)\\x r+1 - x r \\ 2 - a\\Ex r - q\\ 2 . (3.17) 

Thus, if we choose the stepsize a sufficiently small so that 

< a < -fT~ 2 a~ 2 \\E\\~ 2 , (3.18) 

then the above estimate shows that 

[A^ + A^IA^ + A- 1 ], 

which completes the induction. Moreover, the induction argument shows that if the stepsize a 
satisfies the condition (|3.18p . then the descent condition (|3.17p holds for all r > 1. 

Now we argue that the sequence {x r ,y r } converge to a primal-dual optimal solution pair for 
the original optimization problem (jl.ip . By the descent estimate (|3.17p . we have 

||x r+1 -x r \\ -> 0, \\Ex r - q\\ 0. (3.19) 

By the assumption ()3. 12[) and the above induction argument, the sequence {x r ,y r } is bounded. 
Let (x,y) be any limit point of the sequence {x r ,y r } so that 

lim (x r ,y r ) = (x,y) 

rGK.,r— s>oo 

for some subsequence 1Z. By (j3.3|) . we obtain 

d* -d(y r ) <r'\\Ex r -q\\ 2 , r G K. 
By taking limit along this subsequence 1Z, we have 

< lim (d* - d(y r )) < r' lim \\Ex r - q\\ 2 = 0. 
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This further implies that 



lim (d* - d{y r )) = d* - d{y) = 0. 



Thus, each limit point of the sequence {y r } is a dual optimal solution. Similarly, by taking limit 
in (|3.16p and using (|3.19j) . we obtain 

lim \\x r — x r \\ = 0. 

Thus, \\x — x r \\ < \\x r — x\\ + \\x r — x r \\ — > as r — > oo and r G 1Z. Since x r G X(y r ), it follows 
that the limit point of {x r } r< zn must be an element of X(y). Thus, x G X(y), implying that x is a 
primal optimal solution for ([Lip . 

We now show that the sum of optimality gaps + A r p in fact contracts geometrically after 
each ADMM iteration. To this end, we first notice that the condition 

[L(x r+1 ; y r ) - d(y r )] + [d* - d(y r )] = A r d + A r p <A d + A p 

implies 1 1 ^c 7 " -1-1 1 1 + \\y r \\ < 5 for all r > 0. This further implies ||x r || < 5 and \\y r \\ < 5 for all r > 1. 
Therefore, it follows from (|3.3[) of Lemma 13. II that we have the following cost-to-go estimate 

A r d = d*- d(y r ) < r'\\Vd(y r )\\ 2 = r'\\Ex r - q\\ 2 , (3.20) 

for some r' > and for all r > 1. 

Moreover, we can use Lemma 13. II to bound ||x r+1 — x r \\ 2 from below by A p . In particular, we 
have from (|3.16p and Lemma 13. II that 

A r p <(\\x r+1 -x r \\ 2 + ('\\x r -x r \\ 2 



< (\\x r+1 - x r \\ 2 + cVV||x r+1 - x r \\ 2 

= (c+cw 



Substituting this bound and (|3.20p into (|3.17p . and assume that a > satisfies ([3.18p . we obtain 



[A^ + ASMA^ + A^ 1 ] < (a\\E\\ 2 r 2 a 2 - 7 )\\x r+1 -x r \\ 2 - a \\Ex r -q\ 



2 



( 7 - a\\E\\ 2 T 2 a 2 ) , M-i A r 

< -min( ^- a '^' W) ,a(rO^)[A- + Aa 



C + C'rV 2 

When q > is chosen small enough such that (|3.18|) holds, we have 

. f 7 - a\\E\\ 2 T 2 a 2 . A A 

A := min { V(w ."frO" 1 ) > o- 
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Consequently, we have 

[AJ + A d ) - [A^ 1 + a- 1 ] < -a[a; + A d ] 

which further implies 

O^ + A^^A^ + A^ 1 ]. 

This shows that the sequence { A^ + A d } is Q-linearly convergent to zero. As a result, we conclude 
that both Ap and A r d converge to zero R- linearly. By the inequality (|3.17p . we can further conclude 
that 

\\x r+1 -x r \\ 2 -> 0, \\Ex r -q\\^0 

R- linearly. It then follows that the primal sequence {x r } converges R-linearly to a primal optimal 
solution. 

We next show that the dual iterate sequence {y r } is also R-linearly convergent. To this end, 
notice that the inequalities (|3.16|) and (|3.17p imply 

[A r p + A r d ] - [A"' 1 + A^ 1 ] < (a\\E\\ 2 - 1T - 2 a- 2 )\\x r - x r \\ 2 - a\\Ex r - qf. (3.21) 

Then by (|3.2ip . we see that both \\x r — x r \\ — > and \\Ex r — q\\ — > R-linearly. This further implies 
that Ex r — q — > R-linearly. By the ADMM dual update formula (\1.7\i . the linear convergence 
of \\Ex r — q\\ — > implies that ||y r+1 — y r \\ — > R-linearly. Thus, the dual iterate sequence {y r } 
is R-linearly convergent. Furthermore, by part (a), the limit of the iterate sequence {x r ,y r } must 
exist and it is primal-dual optimal solution pair for (|l.ip . 

Finally, since 

f( x r+i)- d * = [f(x r+1 ) - d{y r )] + [d(y r ) - d*] 

= [L(x r+1 ;y r ) - d(y r )] - [d* - d(y r )} - (y r ,q- Ex r+1 ) - ^\\Ex r+1 - q\\ 2 



A r p -A r d - (y r , q - Ex r+1 ) - ?-\\Ex r+1 - q\\ 2 



and 



A£->0, A r d ^0, \\q-Ex r \\ -> 

linearly, it follows that f(x r+1 ) — d* — > R-linearly (recall that y r is bounded by 5). The proof is 
complete. Q.E.D. 

An immediate corollary of Theorem 13.11 and Lemma [2 .21 is that the ADMM algorithm is globally 
linearly convergent if the objective function / satisfies any of the two conditions of Lemma 12.21 
In particular, it implies the linear convergence of the ADMM for several contemporary applica- 
tions including LASSO, Group LASSO and Sparse Group LASSO without any strong convexity 
assumption. 
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4 Variants of ADMM 



The convergence analysis of Section 3 can be extended to some variants of the ADMM. We briefly 
describe two of them below. 



4.1 Proximal ADMM 

In the original ADMM (|1.7|) . each block x k is updated by solving a convex optimization subproblem 
exactly. For large scale problems, this subproblem may not be easy to solve unless the matrix E k 
is unitary (i.e., E k E k = I) in which case the variables in x k can be further decoupled (assuming f k 
is separable) . If the matrix E k is not unitary, we can still employ a simple proximal gradient step 
to inexactly minimize L(x r 1 +1 , ...,i^ 1 1 ,ik,i r t+1 , ...,x r K ). More specifically, we update each block of 
x k according to the following procedure 

x k +1 = argmuJ h k (x k ) + (y r , q - E k x k ) + {A^V g k (A k x r k ) , x k - x r k ) + -\\x k - x r k \\ 2 

+(pE% ( £ E jX r + l + Y, M -q),x k -x r k )} (4.22) 

j<k j>k 

in which the smooth part of the objective function in the k-th. subproblem, namely, 



g k (A k x k ) + (y r ,q - E k x k ) + | E k x k + ^ E jX r + l + ^ Ejx) - q 

j<k j>k 



is linearized locally at x r k , and a proximal term ^\\x k — x r k \\ 2 is added. Here, f3 > is a positive 
constant. With this change, updating x k is easy when h k (the nonsmooth part of f k ) is separable. 
For example, this is the case for compressive sensing applications where h k (x k ) = ||a;fc||i, and the 
resulting subproblem admits a closed form solution given by the component-wise soft thresholding 
(also known as the shrinkage operator). 

We claim that Theorem 13.11 holds for the proximal ADMM algorithm. Indeed, to establish the 
(linear) convergence of the proximal ADMM (|4.22|) . we can follow the same proof steps as that for 
Theorem 13.11 with the only changes in the proof of Lemmas I2.4H2.5I and Lemma 13.11 To see why 
Lemma 12.41 holds, we just need to argue that there is a sufficient descent: 

L(x r+1 ; y r ) — L(x r ; y r ) < — 7||x r+1 — x r || 2 , for some 7 > independet of y r . (4.23) 

This property can be seen by bounding the smooth part of L{x r { rl , Xf^\, x k , x r k+1 , ...,x r K ), which 
is given by 

2 



L k (x k ) := g k {A k x k ) + (y r ,q - E k x k ) + - 1| ^ Ejx^ 1 + ^ Ejx'j + E k x k - q 

j<k j>k 
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with the Taylor expansion at x k : 

L k (x r k +1 ) < L k {xl) + <VL fe (4),4 +1 - xl) + V -\\xl +1 - xlf (4.24) 

where 

v := L\\A k \\\\Al\\+ p\\ElE k \\ 

is the Lipschitz constant of L k (-) and L is the Lipschitz constant of Vpfc(-). Making the above 
inequality more explicit yields 

t ( r+l r+1 r+1 r r r\ Tf^r+l r+1 r r r r\ 

,...,x fc _ 1 ,x fc ,x k+1 ,...,x K ,y ) i^{x 1 , x fc _ 1 , x k , x k+1 , x K , y ) 

< h k (x r k +1 ) - h k (xl) + (y r ,E k (x r k - 4 +1 )> + (A T k Vg k (A k xl),xl +1 - x r k ) 

V " r+1 „r\\2 



+ ( peI ( E M +1 + E - « ] > 4 +1 - 4 ) + 2 K +1 - 4 

< -Hn^+l _ _ri|2 , ^||_r+l _ _r||2 

-i 2 " fc fc " 2 

= -^x^-x^l 2 , Vfc, (4.25) 
provided the regularization parameter f3 satisfies 

7 := 1 OS - r/) > 0. 

In the above derivation of ()4.25p . the first step is due to ()4.24j) . while the second inequality follows 
from the definition of x T k +1 (cf. 1)32211 ) . Summing (|4.25|) over all A; yields the desired estimate of 
sufficient descent (|4.23p . 

To verify that Lemma 12.51 still holds for the proximal ADMM algorithm, we note from the 
corresponding optimality condition for (|4.22p 



v[ +1 = prox. 



„r+l 



A T k V Xk g k {^l) + E T k f - P E T k ( E M +1 + E E & " 9 ) " ^ ~ 4) 



Using this relation in place of (12.8P and following the same proof steps, we can easily prove that 
the bound (|2.7p in Lemma 12.51 can be extended to the proximal ADMM algorithm. Thus, the 
convergence results in Theorem 13.11 remain true for the proximal ADMM algorithm (|4.22p . 

It remains to verify that Lemma 13.11 still holds true. In fact the first part of Lemma 13.11 can 
be shown to be independent of the iterates, thus it trivially holds true for the proximal ADMM 
algorithm. To show that the second part of Lemma l3.1l is true, note that the optimality condition 
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of the proximal ADMM algorithm implies that 



prox. 



Xk L {x^ 3 > k ; f) - P(x^ - xl) 



„r+l 



■= prox hfc [4 - V Xk L(x r ; y r ) - e\ 



where in this case e k is given as 



„r ~H-1 



+ V Xk L[ {x^ k }, {x]}^ k - y r - V Xk L(* r -y) + P{ 



-r+l 



It is then straightforward to show that the norm of el can be bounded by c \\x r — x r+1 || for some 
constant c > 0. The rest of the proof follows the same steps as in Lemma 13. 11 



4.2 Jacobi Update 



Another popular variant of the ADMM algorithm is to use a Jacobi iteration (instead of a Gauss- 
Seidel iteration) to update the primal variable blocks {x k }. In particular, the ADMM iteration 
(|1.7|) is modified as follows: 



„r+l 



argmin h k (x k ) + g k {A k x k ) - (y r ,E k x k ) + 



P 



j-k 



EkXk + ^Ejx'j - q 

3& 



V k. (4.26) 



The convergence for this direct Jacobi scheme is unclear, as the augmented Lagrangian func- 
tion may not decrease after each Jacobi update. In the following, we consider a modified Jacobi 
scheme with an explicit stepsize control. Specifically, let us introduce an intermediate variable 
w = (wf , • • • , w^) T 6 3ft n - The modified Jacobi update is given as follows: 



= argmin h k (x k ) + g k (A k x k ) - (y r ,E k x k ) + 



■i:k 



E k x k + ^E. 



V 

j X j 



1 



where a stepsize of 1/K is used in the update of each variable block. 

With this modification, we claim that Lemmas 12.4112.51 and Lemma 13. II still hold. In particular, 
Lemma 12.41 can be argued as follows. The strong convexity of L(x;y) with respect to the variable 
block x k implies that 

T ( T T TV T T\ T ( V V TV T T\ 

l, [Xi, ■ ■ ■ , x k _ 1 , x k , x k+1 • • • , x K ;y ) - n \x x , • • • , x k _ 1 ,w k ,x k+1 • • • , x K ; y ) 

>tK +1 -4II 2 , vfc. 



r+l 



;), 



V k, (4.27) 
(4.28) 
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Using this inequality we obtain 



L(x r ;y r )-L(x r+1 ;y r ) 
= L(x r ;y r ) - L (^^x r + ^w r+l ;y r 



\ k=l 
1 K 



L(x r ; y r ) — L [ — } ^x\, ■ ■ ■ , x^„ l5 w T k , x r k+1 • • • , x r K ); y r 

K 



1 K 

( L ( x ' r ^ r ) ~ L { x ir-- ,x r k -i,w r k +1 ,x r k+1 --- ,x r K ;y r )) 



k=l 
K 

\ 7 II r+l r ||2 

k=l 

7 II r+l rn2 

= — 1| w — X || . 

where the first inequality comes from the convexity of the augmented Lagrangian function. 

From the update rule (|4.28p we have K(x r k +1 — x k ) = (w k +1 — x r k ), which combined with the 
previous inequality yields 

L(x r ; y r ) - L(x r+1 ; y r ) > -fK\\x r+1 - x r \\ 2 . 

The proof of Lemma 12.51 also requires only minor modifications. In particular, we have the 
following optimality condition for (|4.26|) 



„r+l 



w 



l +1 - A T k V Xk g k {A k wl +1 ) + E T k f - pE T k [ £ E jX ] + E k w^ - q 



Similar to the proof of Lemma |2.5| we have 

- prox, fc [xl - AlV Xk g k (A k xl) + E%y r - P E T k (Ex r - q)] \\ < c\\w r+1 



x 



Utilizing the relationship K(x k +1 — x k ) = {w r k +l — x k ), we can establish Lemma 12.51 by following 
similar proof steps (which we omit due to space reason) . 

Lemma 13. II can be shown as follows. We first express w r k +1 as 



r+l 



prox, fc K +1 - V Xk L ({x^ k },w r k +1 ;y r )] 
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where we have defined 

el := V, fc Z ({z^J,K + V) " V ^ Z Wrf) + x l ~ 

Again by using the relationship K{x r k +l — xV) = (w k +1 — xH, we can bound the norm of e r k by 
c'||x r+1 — x T \\, for some c > 0. The remaining proof steps are similar to those in Lemma 13. 11 

Since Lemmas 12 . 4H2 . 51 and Lemma [3. II hold for the Jacobi version of the ADMM algorithm with 
a step size control, we conclude that the convergence results of Theorem 13,11 remain true in this 
case. 

5 Concluding Remarks 

In this paper we have established the convergence and the rate of convergence of the classical 
ADMM algorithm when the number of variable blocks are more than two and in the absence of 
strong convexity. Our analysis is a departure of the conventional analysis of ADMM algorithm 
which relies on the descent of a weighted (semi-)norm of (x r — x*,y r — y*), see [T7H2T)ll2l)ll26y 29 P 5] • 
In our analysis, we require neither the strong convexity of the objective function nor the row 
independence assumption of the constrained matrix E. Instead, we use a local error bound to show 
that the sum of the primal and the dual optimality gaps decreases geometrically after each ADMM 
iteration, although separately they may individually increase. 

Acknowledgement: The authors are grateful to Dr. Min Tao of Nanjing University for her 
constructive comments. 
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